Module 12: Spatial Data with Simple Features
The Quarto file used to generate the html file can be obtained by clicking on the Code Links beneath the Table of Contents which will open the Math 3376 Posit Cloud Project where you can open the file 12-Spatial-Data-with-Simple-Features.qmd.
Introduction
This document is intended to help explore how to perform basic manipulation and visualization of spatial data that can be represented using simple features. We will primarily focus on using the sf package for our analysis.
Wrangling and Visualizing Spatial Data in R
Spatial data can take many forms.
For a data scientist, spatial data is usually data for which each observation is geographic region or location.
Simple Features (officially Simple Feature Access) is a set of standards that specify a common storage and access model of geographic feature made of mostly two-dimensional geometries (point, line, polygon, multi-point, multi-line, etc.) used by geographic information systems. It is formalized by both the Open Geospatial Consortium (OGC) and the International Organization for Standardization (ISO). Wikipedia
Simple features are comprised of:
- A geometry object (e.g., a point, a polygon, a line, etc.)
- Attribute data associated with the geometry object (e.g., the temperature at a certain time on a certain day, the number of new cases of a disease in a county in the last month).
Simple Features
As stated in the vignette to the sf package:
Features have a geometry describing where on Earth the feature is located, and they have attributes, which describe other properties. The geometry of a tree can be the delineation of its crown, of its stem, or the point indicating its centre. Other properties may include its height, color, diameter at breast height at a particular date, and so on.
Just to clarify:
- The geometry of an observation describes where the object is located.
- A geometry can be a point, a polygon (which is really just an ordered collection of points), or something more complicated.
- The attributes of a geometry object are what data scientists would usually consider the “data”.
All geometries are made of points, which can be combined to create more and more complex objects. A point can be 2-, 3-, or 4-dimensional.
The most common kinds of points are two-dimensional and are described using a 2-dimensional set of coordinates (X and Y), e.g., longitude and latitude or easting and northing.
- 2-dimensional points are referred to as XY points.
A 3-dimensional point could include the X and Y coordinates as well as the altitude of the 2-dimensional point in 3-dimensional space.
- The Z coordinate is another dimension denoting altitude of the point.
- Combining a Z coordinate with an XY coordinate results in a 3-dimensional XYZ point.
Alternatively, a 3-dimensional point could include some other measure associated with the point.
- The M coordinate is another dimension denoting some measure associated with the point.
- It’s pretty rare, but examples include time or measurement error.
- Combining an M coordinate with an XY coordinate results in a 3-dimensional XYM point.
Combining X, Y, Z, and M coordinates results in a 4-dimensional point.
Packages and Tools for Working with Simple Features
In what follows, we will use several packages.
- The
sfpackage is an R package for working with simple features (sfobjects) both in terms of the geometry objects and the associated attributes.- The
sfpackage can import, manipulate, and plotsfobjects. - The
sfpackage is intended to supersede thesppackage, which is an older R package for working with spatial data. - Since the
sppackage is being superseded by thesfpackage, I recommend working with thesfpackage for spatial data analysis moving forward.
- The
Because of how sf objects are represented in R, simple features (once constructed or imported) can be manipulated and plotted by many other well-known packages.
- The
dplyrpackage can be used to manipulatesfobjects, and we will utilize it as certain times. - The
ggplot2package can also be used to plotsfobjects, which we will demonstrate.
Choosing a color palette to represent the attributes of our simple features is very important.
- The
grDevicespackage included withbaseR provides many color palettes.- The traditional palettes are
rainbow,heat.colors,terrain.colors,topo.colors, andcm.colors. - The
hcl.colorspalette function was added in R 4.0.0 and provides numerous excellent color palettes.
- The traditional palettes are
Running the Examples in the documentation found in
?grDevices::hcl.colorswill show examples of the available palettes throughgrDevices.We provide color swatches for many palettes at the end of this tutorial (without the code).
The
colorspacepackage provides many of the same HCL palettes available through thehcl.colorspalette, but it also provides convenient functions for accessing this inggplot2.
Loading Required Packages
We load all of these packages below, with the exception of colorspace, as it has a coords function that masks (i.e., replaces) a needed function in the sf package. So we will access the necessary colorspace function using the :: approach.
# library(colorspace)
library(sf)
library(dplyr)
library(ggplot2)Simple Feature Geometry Objects
The most common simple feature geometry objects used by data scientists are:
| geometry | function | description |
|---|---|---|
POINT |
sf::st_point |
A geometry containing a single point |
POLYGON |
sf::st_polygon |
A geometry with a sequence of points that form a closed ring that doesn’t intersect itself. Multiple rings form outer rings and holes within the polygon. |
We go through the process of creating and plotting these geometries below.
# create an XY point
(p1 <- st_point(c(1,2)))# create an XYZ point
(p2 <- st_point(c(1,2,3)))# the points look the same when plotted in two dimensions
plot(p1)# 3-dim point plotted in 2-dim
plot(p2)- The polygon
outerstarts at \((0,0)\), then goes to \((10,0)\), followed by \((10,10)\), then \((0,10)\), and finally stop back at \((0,0)\) - The polygon
hole1starts at \((1,1)\), then goes to \((1,3)\), followed by \((2,2)\), then \((2,1)\), and finally stop back at \((1,1)\) - The polygon
hole2starts at \((5,5)\), then goes to \((5,6)\), followed by \((6,6)\), then \((7,5)\), and finally stop back at \((5,5)\)
# create a ring (connected set of points)
outer <- matrix(c(0, 0, 10, 0, 10, 10, 0, 10, 0, 0), ncol = 2, byrow = TRUE)
# create additional rings for holes
hole1 <- matrix(c(1, 1, 1, 3, 2, 2, 2, 1, 1, 1), ncol = 2, byrow = TRUE)
hole2 <- matrix(c(5, 5, 5, 6, 6, 6, 7, 5, 5, 5), ncol = 2, byrow = TRUE)
# combine rings into a list
pts <- list(outer, hole1, hole2)
# turn list of rings into a polygon
(pl1 <- st_polygon(pts))# plot polygon
plot(pl1)Question 1
Use the code cell below to construct and display a pentagon that has a triangular hole inside.
Solution to Question 1
# create polygonCommon Geometry Objects
The other common geometry objects are:
| geometry | function | description |
|---|---|---|
LINESTRING |
sf::st_linestring |
A sequence of points that is connected with straight, non-self intersecting lines |
MULTIPOINT |
sf::st_multipoint |
A set of points |
MULTIPOLYGON |
sf::st_multipolygon |
A set of polygons |
MULTILINESTRING |
sf::st_multilinestring |
A set of line strings |
GEOMETRYCOLLECTION |
sf::st_geometrycollection |
A set of the other geometries (except itself) |
Examples of Geometry Objects
We provide examples of creating and plotting these geometries below.
# create a matrix of multiple points
(pts <- matrix(rnorm(10), ncol = 2)) [,1] [,2]
[1,] -0.2942841 0.4042901
[2,] -0.5631947 -2.8190104
[3,] 0.3776016 -1.3146976
[4,] -0.1430829 1.4013844
[5,] 0.2235321 -1.9659875
# convert matrix of points to linestring
(ls1 <- st_linestring(pts))
plot(ls1)# convert matrix of points to multipoints
(mp1 <- st_multipoint(pts))
plot(mp1)# create multipolygons
pol1 <- list(outer, hole1, hole2)
pol2 <- list(outer + 24)
mp <- list(pol1, pol2)
(mpl1 <- st_multipolygon(mp))
plot(mpl1, axes = TRUE)# create a multilinestring
(pts2 <- matrix(rnorm(6), ncol = 2)) [,1] [,2]
[1,] 1.189007 -1.0965091
[2,] 1.721922 -0.7875281
[3,] 0.918795 0.6693628
(ml1 <- st_multilinestring(list(pts, pts2)))
plot(ml1)# create a geometry collection
(gc <- st_geometrycollection(list(p1, pl1, ls1)))
plot(gc, col = "grey")How Do I Know What Type of Geometry I Need?
- Often, this is determined automatically when reading in shapefiles (which we’ll discuss later).
- Attributes observed at a single location require a
POINT. - Most regions can be represented by a
POLYGON. - Complicated objects made of regions, e.g., an island chain like Hawaii, require
MULTIPOLYGONS. - The other types are for more complicated objects.
- There are 10 other rarer geometry types that we do not discuss (
CIRCULARSTRING,CURVE,SURFACE,TRIANGLE,COMPOUNDCURVE,CURVEPOLYGON,MULTICURVE,MULTISURFACE,POLYHEDRALSURFACE,TIN). We can learn about them through the additional resources provided at the end of this document.
Coordinate Reference Systems
A coordinate reference system (CRS) must be provided in order to place a point on the earth’s surface.
When we import a geometry object from file, the CRS will often be provided.
- The WGS84 CRS is often the default for longitude/latitude coordinates.
Sometimes, in order to combine geometry objects, we must specify the CRS of a geometry object.
There are many CRSs. A CRS is often used because it is known to have a certain desirable property. A discussion of CRSs is beyond the scope of this tutorial. And when you do need to know about CRSs, it will probably be so specific that a general discussion won’t help. However, here are a few references related to CRSs that may provide a bit more detail:
Constructing Simple Feature (sf) Objects
An sf object is a data.frame that has a simple feature geometry list column (i.e., a column of geometry objects). So we can work with sf objects similar to how we would work with a data.frame object, though it may have different default behaviors because it has been extended to an sf object.
- The geometry-list column contains the geometry object for each observation.
- The other columns of the
sfobject contain the attributes of the geometry object. - Each row of the
sfobject is a simple feature. Alternatively, each observation is a simple feature.
In practice, sf objects are often initially created by reading in a shapefile (more on that later). However, particularly for POINT observations, we may need to create an sf object manually.
In what follows, we create sf objects for POINT geometry objects and POLYGON geometry objects.
- We can use the same previously discussed functions (e.g.,
sf::st_point,sf::st_polygon, etc.) to create the geometry objects. - The
sf::st_sfcfunction is used to create a geometry list column. - The
sf::st_sffunction is used to combined adata.frameof attributes with the geometry list column.
# create POINT objects
pt1 <- st_point(c(0, 0))
pt2 <- st_point(c(0, 1))
pt3 <- st_point(c(1, 1))
# create geometry list column
glc1 <- st_sfc(list(pt1, pt2, pt3))
# is glc1 a list?
is.list(glc1)[1] TRUE
# what class is glc1
class(glc1)[1] "sfc_POINT" "sfc"
# create attribute data.frame with temperature and rainfall attributes
df1 <- data.frame(temperature = c(10, 11, 10.4), rainfall = c(1, 1.3, 0.9))
# create sf object
sf1 <- st_sf(df1, geometry = glc1)
# class of sf1
class(sf1)[1] "sf" "data.frame"
# plot sf1
plot(sf1["rainfall"], pch = 20, axes = TRUE)# create polygon objects
outer1 <- matrix(c(0, 0, 1, 1, 0, 0, 1, 1, 0, 0), ncol = 2)
pl1 <- st_polygon(list(outer1))
# outer2 is outer1 shifted 1 unit to the right
outer2 <- matrix(c(1, 1, 2, 2, 1, 0, 1, 1, 0, 0), ncol = 2)
pl2 <- st_polygon(list(outer2))
# create geometry list columns
glc2 <- st_sfc(list(pl1, pl2))
# what class is glc2
class(glc2)[1] "sfc_POLYGON" "sfc"
# create second sf object (only include an attribute column and geometry)
sf2 <- st_sf(cases = c(10, 12), geometry = glc2)
# class of sf2
class(sf2)[1] "sf" "data.frame"
# plot sf2
plot(sf2)# what happens if you combine geometry types?
glc3 <- st_sfc(list(pt1, pt2, pt3, pl1, pl2))
# what class is glc3
class(glc3)[1] "sfc_GEOMETRY" "sfc"
# create sf3
sf3 <- st_sf(attribute1 = rnorm(5), geometry = glc3)
# class of sf3
class(sf3)[1] "sf" "data.frame"
# plot sf3
plot(sf3, pch = 20, pal = topo.colors)# plot sf3 with even breaks
plot(sf3,
pch = 20,
pal = topo.colors,
breaks = seq(min(sf3$attribute1), max(sf3$attribute1), length = 12)
)Importing Shapefiles as sf Objects
A data scientist is most likely to work with sf objects obtained from importing a shapefile into R.
ArcGIS defines a shapefile as:
A shapefile is a simple, nontopological format for storing the geometric location and attribute information of geographic features. Geographic features in a shapefile can be represented by points, lines, or polygons (areas). What is a shapefile?
Generally, a shapefile can be imported into R as an sf object.
- Each row is an observation related to a geometry object.
- There should be a
geometrycolumn that contains the geometry object for each observation (this is the geometry list column). - The other columns will represent the attributes associated with each geometry object.
Shapefiles are widely available for describing many different spatial objects like counties, census tracts, zip code tabulation areas (ZCTAs), states, countries, etc. There are even shapefiles that can be used to describe other spatial objects like roads, rivers, lakes, etc.
- A simple web search with appropriate terms will usually bring up a website with a relevant shapefile for our data.
- e.g., search “usa shapefile” brings up a Census bureau page with many different shapefiles for different areas and characteristics of the United States.
- A “shapefile” is often a zipped folder that contains many files inside it.
- The
.shpfile is usually the file we want to import.
We can import that shapefile into R as an sf object using the st_read function.
- Typically, we want to provide the
shpfile to thedsnargument (data source name) ofst_read.
The file cb_2023_us_state_20m.zip containing a shapefile of the U.S. states has already been downloaded and unzipped into the /data folder in our current working directory which /cloud/project.
- The current working directory is the location to which R currently reads or saves files.
- We can learn what our current working directory is by running
getwd()in the Console. - If we want to change the working directory for all code chunks, we can set it via a
setupcode chunk in the beginning of our document. - If we we want to change the working directory for a single code chunk, we can use the
setwd("path")function, wherepathis the directory path we want to set as our working directory.
In our case, we want to read the file cb_2023_us_state_20m.shp in the cb_2023_us_state_20m folder in the data folder of our current working directory.
We can run the following command to import the desired shapefile.
- The
dsnargument “data/cb_2023_us_state_20m/cb_2023_us_state_20m.shptells R to look for thecb_2023_us_state_20m.shpfile in the file pathdata/cb_state_2023_us_state_20m.
######################################################
# if you are running RStudio locally on your computer
# you will need to download and unzip the folder
# below to access the shape file
######################################################
#download.file("https://github.com/CU-Denver-MathStats-OER/Data-Wrangling-and-Visualization/raw/main/Data/cb_2023_us_state_20m.zip", destfile = "./cb_2023_us_state_20m.zip", method = "auto")
#unzip("cb_2023_us_state_20m.zip", exdir = "./data/cb_2023_us_state_20m")us_sf <- st_read(dsn = "./data/cb_2023_us_state_20m/cb_2023_us_state_20m.shp")Reading layer `cb_2023_us_state_20m' from data source
`/Users/adamspiegler/Dropbox/CUDenver/Math3376/Notes/Posit-Cloud/data/cb_2023_us_state_20m/cb_2023_us_state_20m.shp'
using driver `ESRI Shapefile'
Simple feature collection with 52 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -179.1743 ymin: 17.91377 xmax: 179.7739 ymax: 71.35256
Geodetic CRS: NAD83
The st_read provides helpful information automatically when run. In this case, we learn:
- We imported an
sfobject with 52 features (observations). - The imported
sfobject has 9 attributes (columns). - The features appear to be the simple feature geometry
MULTIPOLGYON. - The dimension is
XY, so we are working with 2-dimensional data. - The bounding box tells us the largest and smallest y-coordinates.
- The CRS is NAD83.
We use the class function to see the classes of us_sf.
class(us_sf)[1] "sf" "data.frame"
us_sf is a data.frame that has been extended to the sf class.
Let’s use the str function to learn more about the structure of us_sf.
str(us_sf)Classes 'sf' and 'data.frame': 52 obs. of 10 variables:
$ STATEFP : chr "48" "06" "21" "13" ...
$ STATENS : chr "01779801" "01779778" "01779786" "01705317" ...
$ GEOIDFQ : chr "0400000US48" "0400000US06" "0400000US21" "0400000US13" ...
$ GEOID : chr "48" "06" "21" "13" ...
$ STUSPS : chr "TX" "CA" "KY" "GA" ...
$ NAME : chr "Texas" "California" "Kentucky" "Georgia" ...
$ LSAD : chr "00" "00" "00" "00" ...
$ ALAND : num 6.77e+11 4.04e+11 1.02e+11 1.49e+11 1.40e+11 ...
$ AWATER : num 1.90e+10 2.03e+10 2.38e+09 4.42e+09 2.93e+10 ...
$ geometry:sfc_MULTIPOLYGON of length 52; first list element: List of 1
..$ :List of 1
.. ..$ : num [1:551, 1:2] -107 -107 -107 -107 -106 ...
..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
- attr(*, "sf_column")= chr "geometry"
- attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA
..- attr(*, "names")= chr [1:9] "STATEFP" "STATENS" "GEOIDFQ" "GEOID" ...
us_sf has 52 rows and 10 columns. The (most) useful columns are:
STUSPS: the state abbreviationNAME: the state nameALAND: the land area of each stateAWATER: the water area of each state
The geometry column is the simple feature geometry list column and contains the geometry object for each observation.
Wrangling sf Objects
An sf object is a type of data frame, similar to how a tibble is a type of data frame. Both classes extend the data.frame class.
- This means that we can work with
sfobjects similarly to how we would work with adata.frame, though the default behaviors may be different. - We can use the
dplyrpackage or similar tools to manipulate ansfobject.
We can select columns of the us_sf sf object in the following ways:
# a vector
us_sf$STUSPS [1] "TX" "CA" "KY" "GA" "WI" "OR" "MO" "VA" "TN" "LA" "NY" "ID" "FL" "IL" "MT"
[16] "MN" "MD" "IA" "DC" "WA" "SD" "OH" "NE" "IN" "MA" "NV" "ND" "AR" "MS" "CO"
[31] "NC" "UT" "OK" "WY" "WV" "SC" "ME" "HI" "AL" "KS" "RI" "CT" "MI" "AK" "DE"
[46] "NM" "PR" "VT" "NJ" "PA" "NH" "AZ"
# these are all subsets of an sf object which is an sf object
#us_sf[,5]
#us_sf[,"STUSPS"]
#us_sf["STUSPS"]
us_sf |> select(STUSPS)Simple feature collection with 52 features and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -179.1743 ymin: 17.91377 xmax: 179.7739 ymax: 71.35256
Geodetic CRS: NAD83
First 10 features:
STUSPS geometry
1 TX MULTIPOLYGON (((-106.6234 3...
2 CA MULTIPOLYGON (((-118.594 33...
3 KY MULTIPOLYGON (((-89.54443 3...
4 GA MULTIPOLYGON (((-85.60516 3...
5 WI MULTIPOLYGON (((-86.93428 4...
6 OR MULTIPOLYGON (((-124.5524 4...
7 MO MULTIPOLYGON (((-95.76565 4...
8 VA MULTIPOLYGON (((-76.02348 3...
9 TN MULTIPOLYGON (((-90.3007 35...
10 LA MULTIPOLYGON (((-94.04305 3...
Note that the $ operator extracts the column from us_sf (returning a character vector), while the other choices subset us_sf and remain sf objects (i.e., the geometry list column is retained).
We can select rows in a similar fashion.
#us_sf[2:3,]
#us_sf[us_sf$STUSPS == "CO",]
#us_sf[startsWith(us_sf$STUSPS, "C"),]
us_sf |> filter(startsWith(us_sf$STUSPS, "C"))Simple feature collection with 3 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -124.4096 ymin: 32.53416 xmax: -71.78936 ymax: 42.04964
Geodetic CRS: NAD83
STATEFP STATENS GEOIDFQ GEOID STUSPS NAME LSAD ALAND
1 06 01779778 0400000US06 06 CA California 00 403673296401
2 08 01779779 0400000US08 08 CO Colorado 00 268418756810
3 09 01779780 0400000US09 09 CT Connecticut 00 12541750274
AWATER geometry
1 20291770234 MULTIPOLYGON (((-118.594 33...
2 1185758065 MULTIPOLYGON (((-109.06 38....
3 1816364426 MULTIPOLYGON (((-73.72777 4...
Note that startsWith is a base R function that finds the elements of a character vector that start with a certain set of characters while start_with is a dplyr function that is used to select columns of a data frame based on a pattern.
A really neat feature of sf objects is that can use a spatial object to select rows. Let’s extract the “Colorado” row from us_sf.
co <- us_sf[us_sf$NAME == "Colorado",]
class(co)[1] "sf" "data.frame"
If we pass the co object as the row argument inside the square brackets, [ ], then the rows of us_sf with geometry objects that intersect the co geometry object will be returned
(co_intersects <- us_sf[co,])Simple feature collection with 8 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -114.8142 ymin: 31.33224 xmax: -94.43152 ymax: 45.0059
Geodetic CRS: NAD83
STATEFP STATENS GEOIDFQ GEOID STUSPS NAME LSAD ALAND
23 31 01779792 0400000US31 31 NE Nebraska 00 198949602728
30 08 01779779 0400000US08 08 CO Colorado 00 268418756810
32 49 01455989 0400000US49 49 UT Utah 00 213921849163
33 40 01102857 0400000US40 40 OK Oklahoma 00 177664484361
34 56 01779807 0400000US56 56 WY Wyoming 00 251458162746
40 20 00481813 0400000US20 20 KS Kansas 00 211753777631
46 35 00897535 0400000US35 35 NM New Mexico 00 314198587197
52 04 01779777 0400000US04 04 AZ Arizona 00 294366106734
AWATER geometry
23 1379309601 MULTIPOLYGON (((-104.0531 4...
30 1185758065 MULTIPOLYGON (((-109.06 38....
32 5963196691 MULTIPOLYGON (((-114.0525 3...
33 3373395450 MULTIPOLYGON (((-103.0025 3...
34 1868053273 MULTIPOLYGON (((-111.0569 4...
40 1345707708 MULTIPOLYGON (((-102.0517 4...
46 726463919 MULTIPOLYGON (((-109.0492 3...
52 854003932 MULTIPOLYGON (((-114.7997 3...
plot(co_intersects["NAME"])plot(st_geometry(co_intersects))Question 2
Create a plot of all states the border Colorado or California.
Solution to Question 2
# create plot of all states that border Colorado or CaliforniaThe base merge and dplyr *_join functions can be used to merge a data frame with an sf object.
- The
sfobject must be the first argument of these functions.
Let’s access a COVID-19 related data frame available in the bayesutils package, which can be installed from GitHub.
- We install the package from GitHub using the
remote::install_githubfunction.- Be sure to install the
remotespackage if you don’t have it.
- Be sure to install the
- We then load the
covid_20210307data set from thebayesutilspackage.
#######################################################
# run each of the commands below in the Console below.
#######################################################
#install.packages("remotes")
#remotes::install_github("jfrench/bayesutils")data("covid_20210307", package = "bayesutils")str(covid_20210307)'data.frame': 50 obs. of 9 variables:
$ state_name : chr "Alaska" "Alabama" "Arkansas" "Arizona" ...
$ state_abb : chr "AK" "AL" "AR" "AZ" ...
$ deaths : num 305 10148 5319 16328 54124 ...
$ cases : num 56886 499819 324818 826454 3501394 ...
$ population : num 738516 4864680 2990671 6946685 39148760 ...
$ income : num 35455 25734 25359 29348 31086 ...
$ hs : num 91 82.1 82.9 85.6 80.7 89.7 88.6 87.7 85.5 84.3 ...
$ bs : num 27.9 21.9 19.5 25.9 30.1 36.4 35.5 27.8 25.8 27.3 ...
$ vote_diff_2020: num -10.5 -25.8 -28.4 0.3 29.8 13.9 20.4 19.3 -3.4 0.2 ...
- attr(*, "na.action")= 'omit' Named int [1:6] 4 9 13 28 43 51
..- attr(*, "names")= chr [1:6] "4" "9" "13" "28" ...
The state_abb column of covid_20210307 has the state abbreviations and matches the STUSPS column of us_sf.
We use the base::merge function to unite these two objects into a new object, covid_us.
covid_us <- merge(us_sf, covid_20210307,
by.x = "STUSPS", by.y = "state_abb")
head(covid_us)Simple feature collection with 6 features and 17 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -179.1743 ymin: 30.22831 xmax: 179.7739 ymax: 71.35256
Geodetic CRS: NAD83
STUSPS STATEFP STATENS GEOIDFQ GEOID NAME LSAD ALAND
1 AK 02 01785533 0400000US02 02 Alaska 00 1.479017e+12
2 AL 01 01779775 0400000US01 01 Alabama 00 1.311850e+11
3 AR 05 00068085 0400000US05 05 Arkansas 00 1.346605e+11
4 AZ 04 01779777 0400000US04 04 Arizona 00 2.943661e+11
5 CA 06 01779778 0400000US06 06 California 00 4.036733e+11
6 CO 08 01779779 0400000US08 08 Colorado 00 2.684188e+11
AWATER state_name deaths cases population income hs bs
1 245347100126 Alaska 305 56886 738516 35455 91.0 27.9
2 4582326383 Alabama 10148 499819 4864680 25734 82.1 21.9
3 3122251184 Arkansas 5319 324818 2990671 25359 82.9 19.5
4 854003932 Arizona 16328 826454 6946685 29348 85.6 25.9
5 20291770234 California 54124 3501394 39148760 31086 80.7 30.1
6 1185758065 Colorado 5989 436602 5531141 35053 89.7 36.4
vote_diff_2020 geometry
1 -10.5 MULTIPOLYGON (((179.4813 51...
2 -25.8 MULTIPOLYGON (((-88.46866 3...
3 -28.4 MULTIPOLYGON (((-94.61792 3...
4 0.3 MULTIPOLYGON (((-114.7997 3...
5 29.8 MULTIPOLYGON (((-118.594 33...
6 13.9 MULTIPOLYGON (((-109.06 38....
Alternatively, we could have used a dplyr *_join function. We’ll use full_join. Note the special syntax in the by argument to address the fact that we want to merge the rows based on different columns in the data frames.
covid_us2 <- full_join(us_sf, covid_20210307, by = c("STUSPS" = "state_abb"))
head(covid_us2)Simple feature collection with 6 features and 17 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -124.5524 ymin: 25.84012 xmax: -80.84313 ymax: 47.05468
Geodetic CRS: NAD83
STATEFP STATENS GEOIDFQ GEOID STUSPS NAME LSAD ALAND
1 48 01779801 0400000US48 48 TX Texas 00 676686238592
2 06 01779778 0400000US06 06 CA California 00 403673296401
3 21 01779786 0400000US21 21 KY Kentucky 00 102266598312
4 13 01705317 0400000US13 13 GA Georgia 00 149485311347
5 55 01779806 0400000US55 55 WI Wisconsin 00 140292627460
6 41 01155107 0400000US41 41 OR Oregon 00 248630419895
AWATER state_name deaths cases population income hs bs
1 18982083586 Texas 44451 2686818 27885195 30069 80.7 25.9
2 20291770234 California 54124 3501394 39148760 31086 80.7 30.1
3 2384223544 Kentucky 4819 410709 4440204 26174 81.9 20.5
4 4419673221 Georgia 17906 1023487 10297484 28838 84.3 27.3
5 29343084365 Wisconsin 7106 621654 5778394 31604 90.1 26.3
6 6168960338 Oregon 2296 157079 4081943 29696 88.8 28.8
vote_diff_2020 geometry
1 -5.7 MULTIPOLYGON (((-106.6234 3...
2 29.8 MULTIPOLYGON (((-118.594 33...
3 -26.4 MULTIPOLYGON (((-89.54443 3...
4 0.2 MULTIPOLYGON (((-85.60516 3...
5 0.6 MULTIPOLYGON (((-86.93428 4...
6 16.6 MULTIPOLYGON (((-124.5524 4...
If a new row is added to the sf object without a corresponding geometry, then an empty geometry object is added for that row.
Plotting Simple Features
The power of the sf package is the ability to easily create plots of spatial data.
Plotting sfc Objects
To simply plot the geometry list column of an sf object, we can use:
st_geometryto extract the list column of simple feature geometries from thesfobject (this will be an object of classsfc).plotto plot thesfcobject.
plot(st_geometry(us_sf))Alternatively, we can directly extract the sfc component of the sf object using $, then plot the sfc object.
plot(us_sf$geometry)We can easily change aspects of the plotted sfc object (or an sf) object using the standard arguments:
axescan be set toTRUEto show the axesxlabandylabwill change the axis labelsxlimandylimcan be used to constrain the plotting regions.- Note that “W” longitude values are indicated using negative numbers, while “E” longitude values are positive numbers.
- Note that “N” latitude values in the northern hemisphere are positive numbers. “S” latitude values in the southern hemisphere are negative numbers.
Each geometry has specific arguments that the user can change (consider looking at the documentation for ?sf::plot.sf for details). In this case, we can change the fill color of the MULTIPOLYGON objects using the color argument and the border using the border argument.
plot(us_sf$geometry, axes = TRUE,
xlab = "longitude", ylab = "latitude",
col = "grey", border = "blue",
xlim = c(-125, -75),
ylim = c(22, 50))We can change the colors of the individual geometry objects with a little creativity. Let’s color all the “C states” (California, Colorado, Connecticut) a little differently than the other states.
# determine indices of C states
c_states <- startsWith(us_sf$STUSPS, "C")
# create a character vector of replicated "grey"
# values matching the number of rows in us_sf
mycol <- rep("grey", nrow(us_sf))
#change "grey" to "orange" for the c_states indices
mycol[c_states] <- "orange"
# create a character vector of replicated "yellow"
# values matching the number of rows in us_sf
myborder <- rep("yellow", nrow(us_sf))
#change "yellow" to "blue" for the c_states indices
myborder[c_states] <- "blue"
# create the customized plot
plot(st_geometry(us_sf), col = mycol, border = myborder, xlim = c(-125, -75),
ylim = c(22, 50))Customize Color for Subset of sf Object
Let’s look at a different approach to coloring a subset of the sf object differently from the rest of the sf object.
# draw all geometries
plot(st_geometry(us_sf), col = "grey", border = "yellow",
xlim = c(-125, -75), ylim = c(22, 50))
# plot subset of all geometries with different colors
# add = TRUE overlays new drawing on current graphic
plot(st_geometry(us_sf)[c_states],
col = "orange", border = "blue",
add = TRUE)Plotting Attributes of an sf Object
By default, use we can use the plot function to plot all the attributes of an sf object. In general, this isn’t very useful.
plot(covid_us)More likely, we will want to plot a single attribute (variable), which can be done by subsetting that variable in the sf object and plotting the subsetted object.
Let’s plot the land area of the states, excluding Alaska and Hawaii. First, we exclude the Alaska and Hawaii rows (and save the filtered object).
covid_us <- covid_us |> filter(!is.element(STUSPS, c("AK", "HI")))
plot(covid_us["ALAND"])Land area is directly related to the area of the state. It would be interesting to visualize the states that have the greatest percentage of water. Let’s create a new variable in covid_us that computes the percentage of the state that is water. We’ll then plot this variable (excluding Alaska and Hawaii)
covid_us <- covid_us |> mutate(prop_water = AWATER/(AWATER + ALAND))
plot(covid_us["prop_water"])Not surprisingly, coastal states and states on the Great Lakes have the highest percentage of water.
We can use the following commands to identify the states with the 6 largest proportions of water.
covid_us |> slice_max(prop_water, n = 6) |> select(NAME, prop_water)Simple feature collection with 6 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -90.41814 ymin: 24.49813 xmax: -69.92826 ymax: 48.21065
Geodetic CRS: NAD83
NAME prop_water geometry
1 Michigan 0.4146584 MULTIPOLYGON (((-84.61622 4...
2 Rhode Island 0.3308018 MULTIPOLYGON (((-71.63147 4...
3 Massachusetts 0.2608631 MULTIPOLYGON (((-70.27553 4...
4 Florida 0.2485776 MULTIPOLYGON (((-81.81169 2...
5 Maryland 0.2172144 MULTIPOLYGON (((-76.04621 3...
6 Delaware 0.2170690 MULTIPOLYGON (((-75.77379 3...
We can change the number of breaks in our color bar via the nbreaks.
We can change the colors in our color bar by specifying the pal argument.
- The
palargument takes a function that, when givenn, the number of desired colors, returns a vector of colors. - The
hcl.colors,rainbow,heat.colors,terrain.colors,topo.colors, andcm.colorsare all color palettes in base R that can be used to change the colors palettes. - The
hcl.colorspalette is particularly good, as it includes color palettesviridisandcividis(corrected viridis) that are particularly well-suited to displaying colors that are color-blind and can be understood when printed in black and white. - The
colorspacepackage also has a wide variety of palettes we might consider.
However, the hcl.colors function has a palette argument to specify the desired palette. To use an hcl.color palette when plotting an sf object, we need to build a custom palette that only requires the number of colors desired. We create those function for the viridis and cividis palettes below. We then see that it produces the desired results.
plot(covid_us["prop_water"], pal=terrain.colors, nbreaks = 5)viridis_pal <- function(n) hcl.colors(n, palette = "viridis")
cividis_pal <- function(n) hcl.colors(n, palette = "cividis")
# dirty approach to see colors in the palette
image(matrix(0:4), col = viridis_pal(5))image(matrix(0:4), col = cividis_pal(5))Let’s do some analysis of the actual COVID-19 data. Let’s create a new column for death_rate_100k, which is the number of confirmed and probable COVID-19 deaths per 100,000 persons. Let’s display the death rate using the viridis palette and then the cividis palette.
covid_us <- covid_us |> mutate(death_rate_100k = deaths/population*100000)
plot(covid_us["death_rate_100k"],
nbreaks = 5,
pal = viridis_pal)plot(covid_us["death_rate_100k"],
nbreaks = 5,
pal = cividis_pal)Plotting sf Objects with ggplot2
The ggplot2 package includes a geometry for sf objects, geom_sf, the is typically adequate for using ggplot2 to produce graphics for sf objects.
- The
sfobject is passed as thedataargument to theggplotfunction. geom_sfis the geometry of thedata- The
XYcoordinates of thesfobject are automatically mapped toxandyaesthetics. - The
color,linetype,fill, etc., of the geometry objects can be changed by mapping an attribute of thesfobject to the appropriate aesthetic.
If we only want to plot the geometry list column (i.e., the geometry objects) of each observation, then we don’t have to specify any aesthetics.
ggplot(covid_us) + geom_sf()We will create a choropleth map of the covid_us data.
- A choropleth map is a map of regions colored to indicate the level of some variable associated with the regions.
ggplot(covid_us, aes(fill = death_rate_100k)) +
geom_sf()Changing the color palette for our fill aesthetic to a viridis palette.
ggplot(covid_us, aes(fill = death_rate_100k)) +
geom_sf() +
colorspace::scale_fill_continuous_sequential(palette = "viridis")Change the color palette to viridis using ggplot2’s built-in function.
ggplot(covid_us, aes(fill = death_rate_100k)) +
geom_sf() +
scale_fill_viridis_c(option = "viridis")Reverse the order of the colors.
ggplot(covid_us, aes(fill = death_rate_100k)) +
geom_sf() +
scale_fill_viridis_c(option = "viridis", direction = -1)Use the cividis palette using the scale_fill_viridis function from the viridis package.
ggplot(covid_us, aes(fill = death_rate_100k)) +
geom_sf() +
viridis::scale_fill_viridis(option = "E")Additional Resources
sf help and tutorials from the package authors https://r-spatial.github.io/sf/
Geocomputation with R by Robin Lovelace https://r.geocompx.org/. This book covers tons of aspect of spatial data (not just from an R perspective, but general theory and concepts). This will help you to learn a lot more about representing and working with spatial data in general, not just working with it in R.
Spatial Data Science with applications in R by Edzer Pebesma and Roger Bivand https://r-spatial.org/book/. Also covers much much about representing spatial data than you probably ever thought possible! The authors are the main creators of the sf package.